Add `collapseHz()` #307

brownag · 2024-02-24T17:57:27Z

Here is another draft idea for a common operation people want to perform on SoilProfileCollections. This is a minimal implementation but I think several other options could greatly enhance the ways it could be used. It is likely it could be used to resolve some issues in soilDB such as ncss-tech/soilDB#120, ncss-tech/soilDB#122, ncss-tech/soilDB#135

collapseHz() function makes it easy to aggregate adjacent horizons (collapsing them into a single horizon) using groups based on pattern matching. Numeric properties are aggregated via weighted average, and other properties are aggregated via dominant condition (i.e. the aggregate value is the value with greatest cumulative thickness in each group)

TODO/additional ideas

Custom matching function (not necessarily grepl() based, like thicknessOf() / Add thicknessOf() #306)
Custom numeric/categorical aggregation functions
- Ability to specify multiple summary statistics to be calculated per column (e.g. range, or standard deviation, number of subhorizons, number of NA values, or similar)
- Concatenation and/or thickness tabulation of categories rather than dominant as the only option
Ability to exclude certain numeric properties from numeric aggregation (e.g. color value/chroma should not be averaged)
Integrate with generalized horizon labels (via GHL())--in lieu of specified pattern and possibly change default for hzdesgn when GHL() is set?

Example behavior

# testing collapseHz
library(aqp)
#> This is aqp 2.0.3
data(loafercreek, package = "soilDB")

x <- loafercreek[c(5,6,7,10,11),]

# collapse on hzdesgnname
a <- collapseHz(x, c(`A` = "^A", `BA` = "BA|Bw", `Bt` = "[ABC]+t", `Cr` = "Cr|R"))
profile_id(a) <- paste0(profile_id(a), "_collapse")
c(a, x) |> 
  plot(color = "clay")

# collapse on texture class groups / custom hzdesgnname
b <- collapseHz(x, 
                pattern = c(`l` = "^(l|sl)$", `c` = "^(cl|c|sicl|scl)$", `si` = '^si|sil$'),
                hzdesgn = "texcl")
profile_id(b) <- paste0(profile_id(b), "_collapse")
c(b, x) |> 
  plot(color = "texcl", name = "texcl")

dylanbeaudette · 2024-02-24T23:51:41Z

This is a great idea, and perfect extension to the GHL foundation of ideas and methods. I'm surprised that we never included it before.

From the texture class example above, it isn't clear to me how a dominant condition can be applied, when there is a (I think) full mapping of new labels to REGEX-matched patterns. Does the dominant condition come into play when an unmatched horizon label is encountered?

Good call on excluding color, perhaps a reasonable place to use simulated mixtures. Or, it could be that someone is only interested in Munsell value, and in that case a wt. mean is probably fine.

brownag · 2024-02-25T00:18:46Z

This is a great idea, and perfect extension to the GHL foundation of ideas and methods. I'm surprised that we never included it before.

Yeah, agreed, its something I feel like we have talked a lot about but just never have implemented a generic function for.

From the texture class example above, it isn't clear to me how a dominant condition can be applied, when there is a (I think) full mapping of new labels to REGEX-matched patterns. Does the dominant condition come into play when an unmatched horizon label is encountered?

Dominant condition is based on the thickest horizon's value within each "adjacency" group. There is no aggregation applied to an unmatched horizon label--it just passes through to the result. As currently implemented this is modifying the hzdesgn column in place, which could be the generalized horizon label column or in these examples is the hzdesgnname().

I need to ponder a bit on this as far as what the right ergonomics are. Probably better to have the function calculate a new horizon designation column and set it as the generalized horizon label, or horizon designation label, or both? Rather than modifying existing data--which might be important for downstream context.

Good call on excluding color, perhaps a reasonable place to use simulated mixtures. Or, it could be that someone is only interested in Munsell value, and in that case a wt. mean is probably fine.

Yeah, I think I want to keep the function relatively "dumb" but specific columns could be targeted with specific methods by either calling reduceSPC() on the input before hand, and then a custom aggregation function (TODO), or alternately an argument that allows exclusion of a vector of some column names from the transformation applied to the horizon data frame, or forces categorical aggregation

brownag · 2024-02-25T02:38:59Z

I must note that this concept is implemented in part as dissolve_hz() by @smroecker (

aqp/R/segment.R

Line 275 in fcad3e8

    
           dissolve_hz <- function(object, by, id = "peiid", hztop = "hzdept", hzbot = "hzdepb", collapse = FALSE, order = FALSE) {

) -- but that function does not operate on SoilProfileCollection objects.

I forgot about that function and reinvented part of the wheel here, I suppose. It is nice that dissolve_hz() can operate on several grouping variables, which is something that could be considered here as well, but it would behave differently (i.e. collapseHz() would alway use the interaction of the specified grouping variables to determine unique adjacent horizon groups, rather than return separate aggregations for each factor). dissolve_hz(collapse=TRUE) appears to use the intersection of grouping variables specified in by.

collapseHz() function intent seeks to be more generic (when TODOs are complete), and also does aggregation of properties across the horizon data.frame in one call, rather than just calculating depths.

dylanbeaudette · 2024-05-16T21:42:18Z

A note for later: one of the simplest cases is flattening on an intended horizon index, such as chkey or phiid, with a short-circut for 99% of the cases where no flattening is required.

- implement ignoring specific numeric columns with `ignore_numerics - add argument`AGGFUN` for column name specific aggregations - default for categories is to returndominant condition rather than just thickest layer

…per profile, add example

dylanbeaudette

Great addition. I'd like to add one more test since there is quite a lot of complexity here. GHL assignment → collapsing horizons is going to be really useful.

R/collapseHz.R

dylanbeaudette · 2024-10-11T16:02:16Z

R/collapseHz.R

+      h <- h[order(h[[idn]], h[[hzd[1]]]),]
+
+      # replace horizons in parent SPC
+      replaceHorizons(x) <- h


Should we retain original horizon IDs and set new ones? This level of accounting could be useful, but adds clutter. I'd vote for new horizon IDs and dump the old ones for simplicity.

Should we retain original horizon IDs and set new ones? This level of accounting could be useful, but adds clutter.

I don't think it is a good idea to dump the horizon IDs and calculate new ones. This could create unintended conflicts with the auto-calculated sequential integer IDs.

I think it is not desirable for the calculated hzID nor one set by the user e.g. chkey or chiid. Essentially I want to retain reference to the source horizon. I don't see an issue with current behavior that keeps a subset of the parent hzidcolumn() values. Can you elaborate on what you are thinking here?

Maybe I missed something: when two horizons are collapsed what is the resulting horizon ID?

The horizon ID (character data type) should be from the thickest horizon in the group. It is not handled differently than any other horizon-level variable by default, but you can supply a custom aggregation function (to e.g. concatenate IDs) if desired

brownag · 2024-10-11T16:21:32Z

Great addition. I'd like to add one more test since there is quite a lot of complexity here. GHL assignment → collapsing horizons is going to be really useful.

Thanks. What else would you like to test?

dylanbeaudette · 2024-10-11T17:32:54Z

Great addition. I'd like to add one more test since there is quite a lot of complexity here. GHL assignment → collapsing horizons is going to be really useful.

Thanks. What else would you like to test?

Some degenerate cases, and an additional hand-verified example. I can add those after it is merged.

brownag · 2024-10-11T18:35:58Z

Great addition. I'd like to add one more test since there is quite a lot of complexity here. GHL assignment → collapsing horizons is going to be really useful.

Thanks. What else would you like to test?

Some degenerate cases, and an additional hand-verified example. I can add those after it is merged.

I've tested this on some fetchNASIS() results with multiple labsampnum per phiid using by="phiid". Go ahead and add the tests to this PR, I'm not in a rush to merge it if there are other items we still need to check.

dylanbeaudette · 2024-10-11T19:51:33Z

Testing out PR, and I see that many numeric columns are converted to character:

data("jacobs2000", package = "aqp")
x <- jacobs2000

a_pattern <- c(`A` = "^A",
               `E` = "E", 
               `Bt` = "[B]+t", 
               `Bh` = "[B]+h", 
               `C` = "^C", 
               `foo` = "bar")

m <- collapseHz(jacobs2000, pattern = a_pattern)
profile_id(m) <- paste0(profile_id(m), "_c")

str(horizons(m))

'data.frame':	28 obs. of  18 variables:
 $ id                   : chr  "92-1_c" "92-1_c" "92-1_c" "92-1_c" ...
 $ name                 : chr  "A" "E" "Bt" "B/C" ...
 $ sand                 : chr  "85" "80" "51" "41" ...
 $ clay                 : chr  "7" "6" "41" "50" ...
 $ matrix_color         : chr  "#675F56FF" "#B1904AFF" "#BF8B37FF" NA ...
 $ time_saturated       : chr  "0" "0" "0" "0" ...
 $ top                  : num  0 18 43 153 156 0 18 46 145 0 ...
 $ bottom               : num  18 43 153 156 213 18 46 145 213 25 ...
 $ matrix_color_munsell : chr  "10YR 4/1" "2.5Y 6/6" "10YR 6/8" NA ...
 $ depletion_code       : chr  NA NA "f2" NA ...
 $ depletion_munsell    : chr  NA NA "10YR 7/3" NA ...
 $ depletion_color      : chr  NA NA "#C4AC8CFF" NA ...
 $ concentration_code   : chr  NA NA "f2" NA ...
 $ concentration_munsell: chr  NA NA "2.5YR 4/8" NA ...
 $ concentration_color  : chr  NA NA "#964C26FF" NA ...
 $ depletion_pct        : chr  "0" "0" "0" "0" ...
 $ concentration_pct    : chr  "0" "0" "2" "0" ...
 $ hzID                 : chr  "1" "2" "4" "6" ...

This only happens sometimes:

# no aggregation required, data are left as-is
m <- collapseHz(jacobs2000, by = 'name')

# numeric data converted to character 
m <- collapseHz(jacobs2000, by = 'matrix_color_munsell')

dylanbeaudette · 2024-10-11T20:17:03Z

Additional tests I would like to add:

checking hand-calculated wt. means with and without NA,
selection of most common category with and without NA
degenerate horizon logic -- maybe we error and ask user to fix first
missing values in by

brownag · 2024-10-11T20:21:57Z

Testing out PR, and I see that many numeric columns are converted to character:

Good catch!! Fixed in 489d2a8

There were two things: a typo/not replaced variable name that changed last night, which caused the numeric aggregation to not work at all, and also there was some implicit conversion to matrix going on (where first column of matrix was character so all others were) that I also introduced last night.

The reason it only happens sometimes is the "aggregation" code was not triggered if all of the labels are unique within profiles, essentially the "short circuit" you refer to here: #307 (comment)

… dominant condition

brownag · 2024-10-11T21:13:16Z

Additional tests I would like to add:

checking hand-calculated wt. means with and without NA,

selection of most common category with and without NA

degenerate horizon logic -- maybe we error and ask user to fix first

missing values in by

Thanks, I added tests and checks for averages and dominant condition with and without NA, and missing values in by in 5bcac07, 268497d, 1a84883

With the dominant condition aggregation, rather than thickest, NA values were always implicitly removed, though depths/thicknesses were handled. That should be fixed now.

I think I would rather not implement tests of horizon logic within this function, or at least have the option to run it without checking. I think the aggregation will generally work "correctly" even with missing depths, overlaps, gaps, though the results may not be sensible. I added some tests for the a couple common cases (filled profiles, 0 profile SPC, by column not in horizon data.frame) in 2902a8a

One might use something like this function to fix certain types of logic errors by crafting either a pattern or column indicating things that should be combined.

brownag · 2024-10-11T21:57:06Z

80dc40f adds a basic screening function that just checks for any NA top or bottom depths and for bottom depths shallower than top. A warning directs the user to use checkHzDepthLogic() (same as depths<-() method)

…llapseHz

dylanbeaudette · 2024-10-15T05:07:45Z

Looks good to me, merge away!

brownag added 3 commits February 24, 2024 09:37

Add collapseHz()

ee45cf2

Add test

a5c22e3

fun

71de71c

brownag mentioned this pull request May 15, 2024

get_chorizon_from_SDA() returning duplicate horizons ncss-tech/soilDB#348

Closed

brownag added 9 commits October 10, 2024 15:50

Merge branch 'master' into collapseHz

5d1586b

implement custom matching function argument FUN

7b30975

update aggregation methods

494f917

- implement ignoring specific numeric columns with `ignore_numerics - add argument`AGGFUN` for column name specific aggregations - default for categories is to returndominant condition rather than just thickest layer

update for using existing GHL(), refine logic for multiple matches …

5339a61

…per profile, add example

add test + docs

729e7c3

allow multiple summary statistics in custom aggregation

17ed949

implement simplified route for existing group IDs or labels

21c3ad8

add test of by argument and custom AGGFUN with data.frame results

9c3b432

doc

f01ffe2

brownag marked this pull request as ready for review October 11, 2024 14:39

dylanbeaudette approved these changes Oct 11, 2024

View reviewed changes

add comment about when aggregation is used

0098010

move horizon extraction and replacement outside the loop

424ebbe

fix issue with aggregated colums always returning character + test

489d2a8

brownag added 2 commits October 11, 2024 13:54

add weighted average and dominant condition tests, with and without NA

5bcac07

NA in by argument not allowed

268497d

brownag added 2 commits October 11, 2024 13:55

handle NA values (when na.rm=FALSE) in aggregation of thickness for…

1a84883

… dominant condition

tests of empty SPC, filled SPC, missing by column

2902a8a

add depth screening function to guide user to checkHzDepthLogic()

80dc40f

dylanbeaudette and others added 5 commits October 11, 2024 15:04

test context

d9a567b

Update NEWS

73c848b

Create collapseHz-mixMunsell-examples.R

23b67f3

Merge branch 'collapseHz' of https://github.com/ncss-tech/aqp into co…

a7afe03

…llapseHz

Merge branch 'master' into collapseHz

70fc053

brownag merged commit 8b31518 into master Oct 15, 2024
5 checks passed

brownag deleted the collapseHz branch October 15, 2024 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `collapseHz()` #307

Add `collapseHz()` #307

brownag commented Feb 24, 2024 •

edited

Loading

dylanbeaudette commented Feb 24, 2024

brownag commented Feb 25, 2024

brownag commented Feb 25, 2024 •

edited

Loading

dylanbeaudette commented May 16, 2024

dylanbeaudette left a comment

dylanbeaudette Oct 11, 2024

brownag Oct 11, 2024 •

edited

Loading

dylanbeaudette Oct 11, 2024

brownag Oct 11, 2024

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

brownag commented Oct 11, 2024

brownag commented Oct 11, 2024

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 15, 2024

Add collapseHz() #307

Add collapseHz() #307

Conversation

brownag commented Feb 24, 2024 • edited Loading

Example behavior

dylanbeaudette commented Feb 24, 2024

brownag commented Feb 25, 2024

brownag commented Feb 25, 2024 • edited Loading

dylanbeaudette commented May 16, 2024

dylanbeaudette left a comment

Choose a reason for hiding this comment

dylanbeaudette Oct 11, 2024

Choose a reason for hiding this comment

brownag Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

dylanbeaudette Oct 11, 2024

Choose a reason for hiding this comment

brownag Oct 11, 2024

Choose a reason for hiding this comment

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

dylanbeaudette commented Oct 11, 2024

brownag commented Oct 11, 2024

brownag commented Oct 11, 2024

brownag commented Oct 11, 2024

dylanbeaudette commented Oct 15, 2024

Add `collapseHz()` #307

Add `collapseHz()` #307

brownag commented Feb 24, 2024 •

edited

Loading

brownag commented Feb 25, 2024 •

edited

Loading

brownag Oct 11, 2024 •

edited

Loading